Remittances Data - Exploratory Data, Model Construction, and Evaluation Final
Author
Cova, Langbehn, Villa & Barros
STEP 1: Setup and Data Loading
Load Required Packages
Code
library(tidymodels) library(tidyverse) library(janitor) library(naniar) library(assertr) library(corrplot)library(gridExtra)library(reshape2)library(glmnet)library(dplyr)library(vip)library(ggplot2)## Turn off scientific notation for readable numbersoptions(scipen =999)
1.1 Load the Data
Filter out missing values in our key predicted outcome as this will cause our predicted models to fail, it provides no additional data to our model. However, for missing values in our predictors, imputation will help ensure non-systemic missingness adversely affect the sample size of our model.
## Get min, max, mean, median, quartilessummary(remit_train$remittances)
Min. 1st Qu. Median Mean 3rd Qu. Max.
6038 70913036 532205864 2659137842 1989312698 137674533896
The mean ($2.66 billion) is much larger than the median ($532 million). The data is right-skewed.
Check overall data distribution
Code
remit_train |>select(remittances, remittances_gdp, gdp, stock, unemployment, deportations) |>pivot_longer(everything()) |>ggplot(aes(value)) +geom_histogram(bins =30) +facet_wrap(~name, scales ="free") +labs(title ="Distribution of Key Variables",caption ="Source: World Bank Remittances Data (1994-2024)" ) +theme_minimal()
Warning: Removed 2003 rows containing non-finite outside the scale range
(`stat_bin()`).
Point Plot to See Distribution
Code
remit_train |>ggplot(aes(remittances, 1)) +geom_point(alpha =0.2) +scale_y_continuous(breaks =0) +labs(y =NULL, title ="Distribution of Remittances",caption ="Source: World Bank Remittances Data (1994-2024)" ) +theme_bw() +theme(panel.border =element_blank())
Most points cluster on the left (lower values) with a few extreme points on the right (confirms right-skewness).
Histogram to See Frequency Distribution
Code
remit_train |>ggplot(aes(x = remittances)) +geom_histogram(bins =30, fill ="steelblue") +theme_minimal() +labs(title ="Distribution of Remittances",x ="Remittances (USD)",y ="Count",caption ="Source: World Bank Remittances Data (1994-2024)" )
The histogram is heavily concentrated on the left with a long tail to the right. This is classic right-skewed data.
Boxplot to Identify Outliers
Code
remit_train |>ggplot(aes(y = remittances)) +geom_boxplot(fill ="steelblue") +theme_minimal() +labs(title ="Boxplot of Remittances",y ="Remittances (USD)",caption ="Source: World Bank Remittances Data (1994-2024)" )
Many points appear above the upper whisker (outliers). These are likely large countries like Mexico that receive billions in remittances.
Examine GDP
Summary Statistics for GDP
Code
## Get summary statisticssummary(remit_train$gdp)
Min. 1st Qu. Median Mean 3rd Qu.
37184925 6202697120 25470203093 307632024591 169292646245
Max.
18743803170827
GDP also shows huge range - from $37 million to $18.7 trillion.
Point Plot for GDP Distribution
Code
remit_train |>ggplot(aes(gdp, 1)) +geom_point(alpha =0.2) +scale_y_continuous(breaks =0) +labs(y =NULL, title ="Distribution of GDP",caption ="Source: World Bank Development Indicators (1994-2024)" ) +theme_bw() +theme(panel.border =element_blank())
GDP shows the same right-skewed pattern as remittances. Large economies have much higher GDP than small economies.
Examine Unemployment
Summary Statistics for Unemployment
Code
## Get summary statisticssummary(remit_train$unemployment)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.100 3.913 6.345 7.907 10.610 34.007 178
Unemployment ranges from 3.91% to 34.01%.
Point Plot for Unemployment Distribution
Code
remit_train |>ggplot(aes(unemployment, 1)) +geom_point(alpha =0.2) +scale_y_continuous(breaks =0) +labs(y =NULL, title ="Distribution of Unemployment",caption ="Source: World Bank Development Indicators (1994-2024)" ) +theme_bw() +theme(panel.border =element_blank())
Warning: Removed 178 rows containing missing values or values outside the scale range
(`geom_point()`).
Unemployment appears more evenly distributed than remittances or GDP.
Examine Inflation
Summary Statistics for Inflation
Code
## Get summary statisticssummary(remit_train$inflation)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-32.741 1.782 4.144 10.368 8.567 2240.169 32
The maximum inflation is 2,240%! (outlier) Most inflation values are between 1.7% and 9%.
Point Plot for Inflation Distribution
Code
remit_train |>ggplot(aes(inflation, 1)) +geom_point(alpha =0.2) +scale_y_continuous(breaks =0) +labs(y =NULL, title ="Distribution of Inflation",caption ="Source: World Bank Development Indicators (1994-2024)" ) +theme_bw() +theme(panel.border =element_blank())
Warning: Removed 32 rows containing missing values or values outside the scale range
(`geom_point()`).
2.8 Data Quality Unit Tests
Test 1: Are All Remittances Positive?
Code
## Test that remittances > 0 when not missingremit_train |>filter(!is.na(remittances)) |>verify(remittances >0) |>summarise(mean_remittances =mean(remittances, na.rm =TRUE))
# A tibble: 1 × 1
mean_remittances
<dbl>
1 2659137842.
All remittances are positive. The mean is $2.65 billion.
Test 2: Are All Years in Valid Range?
Code
## Test that years are between 1994 and 2024remit_train |>verify(year >=1994& year <=2024) |>summarise(mean_year =mean(year))
# A tibble: 1 × 1
mean_year
<dbl>
1 2010.
All years are within the expected range.
Test 3: Are All Unemployment Values Valid?
Code
## Test that unemployment is between 0 and 100remit_train |>filter(!is.na(unemployment)) |>verify(unemployment >=0& unemployment <=100) |>summarise(mean_unemployment =mean(unemployment, na.rm =TRUE))
# A tibble: 1 × 1
mean_unemployment
<dbl>
1 7.91
All unemployment values are valid percentages (0-100%).
2.9 Log Transformations for Skewed Variables
Since remittances and GDP are highly right-skewed, we need to create and examine log.
We add 1 before taking the log to handle any zero values (log(0) is undefined).
Examine Log-Transformed Remittances
Summary Statistics for Log Remittances
Code
## Get summary statistics for log remittancessummary(remit_train$log_remittances)
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.706 18.077 20.093 19.725 21.411 25.648
Histogram of Log Remittances
Code
remit_train |>filter(!is.na(log_remittances)) |>ggplot(aes(x = log_remittances)) +geom_histogram(bins =30, fill ="darkgreen", color ="white") +theme_minimal() +labs(title ="Distribution of Log-Transformed Remittances",x ="Log(Remittances + 1)",y ="Count",caption ="Source: World Bank Remittances Data (1994-2024)" )
The log-transformed remittances show a much more normal distribution compared to the original right-skewed data.
Boxplot of Log Remittances
Code
remit_train |>filter(!is.na(log_remittances)) |>ggplot(aes(y = log_remittances)) +geom_boxplot(fill ="darkgreen") +theme_minimal() +labs(title ="Boxplot of Log-Transformed Remittances",y ="Log(Remittances + 1)",caption ="Source: World Bank Remittances Data (1994-2024)" )
Fewer outliers visible after log transformation.
###Examine Log-Transformed GDP
Summary Statistics for Log GDP
Code
## Get summary statistics for log GDPsummary(remit_train$log_gdp)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.43 22.55 23.96 24.11 25.85 30.56
Histogram of Log GDP
Code
remit_train |>filter(!is.na(log_gdp)) |>ggplot(aes(x = log_gdp)) +geom_histogram(bins =30, fill ="darkblue", color ="white") +theme_minimal() +labs(title ="Distribution of Log-Transformed GDP",x ="Log(GDP + 1)",y ="Count",caption ="Source: World Bank Development Indicators (1994-2024)" )
Log GDP also shows a more normal distribution.
Compare Original vs Log-Transformed
Side-by-Side: Original vs Log Remittances
Code
## Create comparison plotsp1 <- remit_train |>filter(!is.na(remittances)) |>ggplot(aes(x = remittances)) +geom_histogram(bins =30, fill ="steelblue") +theme_minimal() +labs(title ="Original Remittances (Right-Skewed)",x ="Remittances (USD)",caption ="Source: World Bank Development Indicators (1994-2024)")p2 <- remit_train |>filter(!is.na(log_remittances)) |>ggplot(aes(x = log_remittances)) +geom_histogram(bins =30, fill ="darkgreen") +theme_minimal() +labs(title ="Log-Transformed Remittances (More Normal)",x ="Log(Remittances + 1)",caption ="Source: World Bank Development Indicators (1994-2024)")grid.arrange(p1, p2, ncol =2)
The log transformation successfully converts the right-skewed distribution into a more normal distribution, which is better for modeling.
Side-by-Side: Original vs Log GDP
Code
## Create comparison plots for GDPp3 <- remit_train |>filter(!is.na(gdp)) |>ggplot(aes(x = gdp)) +geom_histogram(bins =30, fill ="steelblue") +theme_minimal() +labs(title ="Original GDP (Right-Skewed)",x ="GDP (USD)",caption ="Source: World Bank Development Indicators (1994-2024)")p4 <- remit_train |>filter(!is.na(log_gdp)) |>ggplot(aes(x = log_gdp)) +geom_histogram(bins =30, fill ="darkblue") +theme_minimal() +labs(title ="Log-Transformed GDP (More Normal)",x ="Log(GDP + 1)",caption ="Source: World Bank Development Indicators (1994-2024)")grid.arrange(p3, p4, ncol =2)
Relationship: Log GDP vs Log Remittances
Code
## Scatter plot with log-transformed variablesremit_train |>filter(!is.na(log_gdp), !is.na(log_remittances)) |>ggplot(aes(x = log_gdp, y = log_remittances)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", se =TRUE, color ="red") +theme_minimal() +labs(title ="Log GDP vs Log Remittances",subtitle ="Clearer linear relationship after log transformation",x ="Log(GDP + 1)",y ="Log(Remittances + 1)",caption ="Source: World Bank Development Indicators (1994-2024)")
`geom_smooth()` using formula = 'y ~ x'
The relationship between log GDP and log remittances is more linear than the original variables, which will improve model performance.
Assessing deportations and remittances
Code
ggplot(remit_train, aes(x = deportations +1, y = remittances +1)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", se =TRUE, color ="red") +scale_x_log10() +scale_y_log10() +labs(title ="Log–Log Relationship Between Deportations and Remittances",x ="Log(Deportations + 1)",y ="Log(Remittances + 1)",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 372 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 372 rows containing missing values or values outside the scale range
(`geom_point()`).
On a log–log scale, deportations and remittances show a positive but noisy association, suggesting remittances tend to rise with deportations.
2.10 Categorical Variables Analysis
Count Unique Countries
Code
## Count distinct country namesn_distinct(remit_train$country_name)
[1] 150
We have 150 different countries in the dataset.
View Frequency Table (First 20 Countries)
Code
## Show how many observations per countryhead(table(remit_train$country_name), 20)
Most countries have between 20-30 observations, representing roughly 20-30 years of data.
Create Bar Chart of Top 20 Countries
Code
## Count observations per country and plot top 20remit_train |>count(country_name, sort =TRUE) |>slice_head(n =20) |>ggplot(aes(x =reorder(country_name, n), y = n)) +geom_bar(stat ="identity", fill ="steelblue") +coord_flip() +theme_minimal() +labs(title ="Top 20 Countries by Number of Observations",x ="Country",y ="Count",caption ="Source: World Bank Development Indicators (1994-2024)")
Countries are fairly evenly represented in the sample. Nepal has the most observations (30), while several countries have around 20-29 observations.
Create Bar Chart of Top 15 by total remittances (2024) & Top 15 by remittances/GDP (2024)
Code
remit_train %>%filter(year ==2024) %>%slice_max(remittances, n =15) %>%ggplot(aes(x =fct_reorder(country_name, remittances),y = remittances /1e9 )) +geom_col(fill ="steelblue") +coord_flip() +labs(title ="Top 15 Remittance Receivers (2024)",x ="Country",y ="Remittances (Billions USD)",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
Code
# Top 15 by remittances/GDP (2024)remit_train %>%filter(year ==2024) %>%slice_max(remittances_gdp, n =15) %>%# preferred over top_n()mutate(`Country Name`=fct_reorder(country_name, remittances_gdp)) %>%ggplot(aes(x =`Country Name`, y = remittances_gdp)) +geom_col(fill ="coral") +coord_flip() +labs(title ="Top 15 Countries: Remittances as % GDP (2024)",x ="Country",y ="Remittances (% GDP)",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
When considering just the past year of remittance flows, we see that India leads in amount of remittances received, whereas Nepal leads in remittances as a % of GDP.
# A tibble: 10 × 2
country_name mean_ratio
<chr> <dbl>
1 Lesotho 39.0
2 Tonga 31.2
3 Bermuda 21.0
4 Nepal 20.3
5 Lebanon 20.0
6 Samoa 19.2
7 El Salvador 18.2
8 Kosovo 16.8
9 Jordan 16.0
10 Jamaica 14.6
Code
top10 <- remit_train |>group_by(country_name) |>summarize(mean_ratio =mean(remittances_gdp, na.rm =TRUE)) |>arrange(desc(mean_ratio)) |>slice_head(n =10)ggplot(top10, aes(x =reorder(country_name, mean_ratio), y = mean_ratio)) +geom_col(fill ="coral") +coord_flip() +labs(title ="Top 10 Countries by Average Remittances % of GDP",x ="Country",y ="Average Remittances/GDP",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
Lesotho emerges as the country with the highest average remittances as a percentageo of GDP, followed by Tonga.
Remittances per country (Original Scale)
Code
remit_train |>filter(country_name %in%c("Nicaragua", "El Salvador", "Honduras", "Guatemala", "Haiti", "India" )) |>ggplot(aes(x = year, y = remittances_gdp, color = country_name)) +geom_line(linewidth =1) +scale_y_log10() +labs(title ="Remittance Trends Over Time",subtitle ="Selected Countries (log scale)",x ="Year",y ="Remittances (% of GDP, log scale)",color ="Country",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
Start assessing countries of interest
Code
#Start assessing countries of interescountries_of_interest <-c( "Nicaragua", "El Salvador", "Honduras", "Guatemala", "Haiti", "India")filtered <- remit_train |>filter(country_name %in% countries_of_interest)ggplot(filtered, aes(log(stock), log(remittances), color = country_name)) +geom_point(alpha =0.6) +geom_smooth(method ="lm", se =FALSE) +labs(x ="log(stock)",y ="log(remittances)",title ="Stock–Remittance Relationship for Selected Countries",color ="Country",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 11 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 11 rows containing missing values or values outside the scale range
(`geom_point()`).
Code
#Comment: comparing the stock–Remittance Relationship for Selected Countries
2.11 Relationships Between Variables
GDP vs Remittances (Original Scale)
Code
## Scatter plot with trend lineremit_train |>ggplot(aes(x = gdp, y = remittances)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", se =TRUE, color ="red") +theme_minimal() +labs(title ="Relationship Between GDP and Remittances (Original Scale)",x ="GDP (USD)",y ="Remittances (USD)",caption ="Source: World Bank Development Indicators (1994-2024)")
`geom_smooth()` using formula = 'y ~ x'
There is a clear positive relationship. Countries with larger economies (higher GDP) tend to receive more remittances in absolute dollar amounts. The red line shows the linear trend.
Log GDP vs Log Remittances (Better for Modeling)
Code
## Scatter plot with log-transformed variablesremit_train |>filter(!is.na(log_gdp), !is.na(log_remittances)) |>ggplot(aes(x = log_gdp, y = log_remittances)) +geom_point(alpha =0.3, color ="darkgreen") +geom_smooth(method ="lm", se =TRUE, color ="red") +theme_minimal() +labs(title ="Log GDP vs Log Remittances (Log Scale - Better Linear Fit)",subtitle ="This relationship is more appropriate for linear regression models",x ="Log(GDP + 1)",y ="Log(Remittances + 1)",caption ="Source: World Bank Development Indicators (1994-2024)")
`geom_smooth()` using formula = 'y ~ x'
The log-transformed relationship is more linear and will produce better model predictions.
GDP Per Capita vs Remittances as % of GDP
Code
## Scatter plot with trend lineremit_train |>ggplot(aes(x = gdp_per, y = remittances_gdp)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", se =TRUE, color ="red") +theme_minimal() +labs(title ="GDP Per Capita vs Remittances as % of GDP",x ="GDP Per Capita (USD)",y ="Remittances as % of GDP",caption ="Source: World Bank Development Indicators (1994-2024)")
`geom_smooth()` using formula = 'y ~ x'
Poorer countries (lower GDP per capita) depend more heavily on remittances as a percentage of their economy. Richer countries receive remittances but they represent a smaller share of their total GDP.
Unemployment vs Remittances
Code
## Scatter plot with trend lineremit_train |>ggplot(aes(x = unemployment, y = remittances)) +geom_point(alpha =0.3, color ="steelblue") +geom_smooth(method ="lm", se =TRUE, color ="red") +theme_minimal() +labs(title ="Unemployment vs Remittances",x ="Unemployment Rate (%)",y ="Remittances (USD)",caption ="Source: World Bank Development Indicators (1994-2024)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 178 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 178 rows containing missing values or values outside the scale range
(`geom_point()`).
There is a slight negative relationship (it is weak). Unemployment doesn’t appear to be a strong predictor of remittances.
Correlation Matrix (Numbers)
Code
## Calculate correlations between all numeric variablesremit_train |>select(where(is.numeric)) |>cor(use ="complete.obs") |>round(2)
gdp and remittances: 0.48 - Moderate positive (as GDP increases, remittances increase)
log_gdp and log_remittances: 0.54 - Stronger correlation
stock and deportations: 0.77 - Strong positive (might cause multicollinearity issues)
dist_pop and dist_cap: 1.00 - Perfect correlation! These measure essentially the same thing. We must drop one.
internet and year: 0.66 - Strong positive (internet access increases over time)
gdp_per and vulnerable_emp: -0.63 - Strong negative (richer countries have less vulnerable employment)
Code
## Create visual correlation matrix#Create variable to represent numeric vars numeric_vars_log <- remit_train %>%select(gdp, remittances_gdp, remittances, stock, gdp_per, deportations, vulnerable_emp, inflation, internet, dist_cap, terror) %>%na.omit()cor_matrix_log <-cor(numeric_vars_log, use ="complete.obs")melted_cor_log <-melt(cor_matrix_log)ggplot(melted_cor_log, aes(Var1, Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), size =3) +scale_fill_gradient2(low ="blue", mid ="white", high ="red",midpoint =0, limits =c(-1, 1) ) +labs(title ="Correlation Heatmap (Log-Transformed Variables)",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Remittances are most strongly positively correlated with the stock of migrants (0.5), suggesting migrant presence is a key driver, while other macro variables show only weak associations.
Remittances as % of GDP are negatively correlated with GDP per capita ( −0.49), indicating remittances matter more (relative to GDP) in poorer countries.
2.12 Time Trends Analysis
Remittances Over Time
Code
## Calculate average remittances per year and plotremit_train |>group_by(year) |>summarise(avg_remittances =mean(remittances, na.rm =TRUE)) |>ggplot(aes(x = year, y = avg_remittances)) +geom_line(color ="steelblue", linewidth =1) +geom_point(color ="steelblue") +theme_minimal() +labs(title ="Average Remittances Over Time (1994-2024)",x ="Year",y ="Average Remittances (USD)",caption ="Source: World Bank Development Indicators (1994-2024)")
Remittances show a clear upward trend over 30 years. We can see:
Steady growth from 1994 to 2008.
A dip around 2008-2009 (global financial crisis).
Recovery and continued growth.
Another dip around 2020 (COVID-19 pandemic).
Strong rebound after 2020.
Remittances as % of GDP Over Time
Code
## Calculate average remittances as % of GDP per year and plotremit_train |>group_by(year) |>summarise(avg_remittances_gdp =mean(remittances_gdp, na.rm =TRUE)) |>ggplot(aes(x = year, y = avg_remittances_gdp)) +geom_line(color ="steelblue", linewidth =1) +geom_point(color ="steelblue") +theme_minimal() +labs(title ="Average Remittances as % of GDP Over Time",x ="Year",y ="Remittances as % of GDP",caption ="Source: World Bank Development Indicators (1994-2024)")
Remittances as a percentage of GDP have stayed relatively stable around 3-4% over time. This means remittances are growing roughly in line with GDP growth, not becoming more or less important to economies over time.
2.13 Exploring Lagged Effects
It is likely that past changes to country of origin conditions is likely to be more illustrative of future remittances than current conditions. For example, while a downward shock in GDP may influence current migration, migrants may take time to settle into the US and thus begin remitting back home.
These lagged effect would seem to be the most plausible with GDP, unemployment, terror, deportations, and changes in migrant stock (inward migration)
NOTE: This section is for EXPLORATION only The actual lagged variables used in all models were created at the data loading stage (before train/test split)
## Lagged predictors relationship with remittances (as % of GDP)## GDP per capita# Lagged vs Unlaggedremit_lag |>pivot_longer(cols =c(gdp_per, gdp_lag),names_to ="type",values_to ="value") |>ggplot(aes(value, remittances_gdp, color = type)) +geom_point(alpha =0.3) +geom_smooth(se =FALSE) +theme_minimal() +labs(title ="Lagged vs Current GDP per Capita",x ="GDP per capita",color ="Variable",caption ="Source: World Bank Development Indicators (1994-2024)")
Code
## Doesn't Necessarily Improve Model Fit.
Overall there seem to be better ways to verify whether lagged variable would improve the interpretability of our models.
Code
## Comparing Lagged vs Current GDP per capita for key countries. remit_lag |>filter(country_name %in%c("Nicaragua", "El Salvador", "Honduras","Guatemala", "Haiti", "India")) |>pivot_longer(cols =c(gdp_per, gdp_lag),names_to ="gdp_type",values_to ="gdp_value" ) |>ggplot(aes(x = gdp_value, y = remittances_gdp,color = country_name, linetype = gdp_type)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +theme_minimal() +labs(title ="Lagged vs Current GDP per Capita",x ="GDP per capita (current or lagged)",linetype ="GDP variable",caption ="Source: World Bank Development Indicators (1994-2024)")
Suggestion is that Lagged GDP demonstrates a slightly stronger relationship and thus may improve model fit. Thus it may seem that shocks or changes to prior GDP could help explain current remittances amounts.
Lagged Unemployment
Code
## Comparing Lagged vs Current Unemployment for key countries. remit_lag |>filter(country_name %in%c("Nicaragua", "El Salvador", "Honduras","Guatemala", "Haiti", "India")) |>pivot_longer(cols =c(unemployment, unemp_lag),names_to ="unemp_type",values_to ="unemp_value" ) |>ggplot(aes(x = unemp_value, y = remittances_gdp,color = country_name, linetype = unemp_type)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +theme_minimal() +labs(title ="Lagged vs Current Unemployment",x ="Unemployment (current or lagged)",linetype ="GDP variable",caption ="Source: World Bank Development Indicators (1994-2024)")
For most countries lagged unemployment does not seem to alter model fit substantially for any country other than Haiti.
It likely won’t improve our model fit and thus shouldn’t be included.
Lagged Changes in Terror Level
Code
## Comparing Lagged vs Current Terror for key countries. remit_lag |>filter(country_name %in%c("Nicaragua", "El Salvador", "Honduras","Guatemala", "Haiti", "India")) |>pivot_longer(cols =c(terror, terror_lag),names_to ="terror_type",values_to ="terror_value" ) |>ggplot(aes(x = terror_value, y = remittances_gdp,color = country_name, linetype = terror_type)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +theme_minimal() +labs(title ="Lagged vs Current Terror",x ="Terror (current or lagged)",linetype ="Terror variable",caption ="Source: World Bank Development Indicators (1994-2024)")
It seems terror varies less, and lagged terror levels may not be too explanatory.
Lagged Changes in Deportations
Code
## Comparing Lagged vs Current Deportations for key countries. remit_lag |>filter(country_name %in%c("Nicaragua", "El Salvador", "Honduras","Guatemala", "Haiti", "India")) |>pivot_longer(cols =c(deportations, deportations_lag),names_to ="deportations_type",values_to ="deportations_value" ) |>ggplot(aes(x = deportations_value, y = remittances_gdp,color = country_name, linetype = deportations_type)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +theme_minimal() +labs(title ="Lagged vs Current Deportations",x ="Deportations (current or lagged)",linetype ="Deportations variable",caption ="Source: World Bank Development Indicators (1994-2024)")
Much stronger relationship for key countries in including the lagged effects of deportations in explaining future remittances.
Code
## Comparing Lagged vs Current Deportations for key countries. remit_lag |>filter(country_name %in%c("Nicaragua", "El Salvador", "Honduras","Guatemala", "Haiti", "India")) |>pivot_longer(cols =c(stock, stock_lag),names_to ="stock_type",values_to ="stock_value" ) |>ggplot(aes(x = stock_value, y = remittances_gdp,color = country_name, linetype = stock_type)) +geom_point(alpha =0.4) +geom_smooth(method ="lm", se =FALSE) +theme_minimal() +labs(title ="Lagged vs Current Changes in Migrant Stock ",x ="Migrant Stock (current or lagged)",linetype ="Migrant Stock variable",caption ="Source: World Bank Development Indicators (1994-2024)")
Less Strong change and thus probably doesn’t warrant inclusion.
Takeaways:
Predictive power of lagged deportations and GDP improve model fit the best (shift the slopes of our relationships most).
For the other predictors, including changes to migrant stock, terror, and unemployment the relationships for our key variables barely changed indicating no changes in explanatory power.
Next Steps: using step_mutate in our recipe to add lags to our gdp_per and deportations would account for this.
STEP 3: Model Development
3.1 Cross-validation set up
To improve the accuracy of our estimated error rates, we set up a 10-fold cross validation with 5 repetitions since we have a relatively small number of observations within the training data. We create a recipe for the baseline model and process the full training data using parameter specification.
Code
remit_folds <-vfold_cv(data = remit_train, v =10, repeats =5)
3.2 Baseline Recipe
Code
## Recipe (baseline model)recipe_baseline <-recipe(remittances_gdp ~ stock + gdp_per + gdp_lag + unemployment + dist_cap + terror + deportations + deportations_lag + internet + inflation,data = remit_train) |>step_impute_median(all_numeric_predictors()) |>step_impute_mode(all_nominal_predictors()) |>step_mutate(gdp_per =log(gdp_per +1)) |>step_normalize(all_numeric_predictors())## Processing the full training data using parameter specification. bake(prep(recipe_baseline, training = remit_train), new_data = remit_train)
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 6.47 50 0.153 Preprocessor1_Model1
2 rsq standard 0.0762 50 0.00346 Preprocessor1_Model1
The linear model establishes a baseline performance using ordinary least squares regression with no regularization. This gives us a benchmark to compare other models against.
The model
4.2 Prepare Enhanced Recipe for Regularized Models
Code
#We clean remittances before folding due to missing valuestrain_data2 <- remit_train %>%filter(!is.na(remittances_gdp)) remit_folds <-vfold_cv(train_data2, v =10)#At first, glmnet was throwing errors, so we need to create a recipe that forces#your predictors into a form that ridge/lasso (glmnet) can handle, with no NA, no Inf / -Inf#no constant columns, comparable scales across predictors.recipe_glmnet <- recipe_baseline %>%step_mutate(across(all_numeric_predictors(),~if_else(is.finite(.x), .x, NA_real_))) %>%step_impute_median(all_numeric_predictors()) %>%step_impute_mode(all_nominal_predictors()) %>%step_zv(all_predictors()) %>%step_normalize(all_numeric_predictors()) ctrl <-control_grid(save_pred =TRUE, verbose =TRUE)grid30 <-grid_regular(penalty(), levels =30)metrics1 <-metric_set(rmse)#This control object makes errors visible and traceable
4.3 RIDGE Regression
Code
#this defines ridge regression with a tuned penalty.ridge_spec <-linear_reg(penalty =tune(), mixture =0) %>%set_mode("regression") %>%set_engine("glmnet")# build the workflow (recipe + model)ridge_wf <-workflow() %>%add_recipe(recipe_glmnet) %>%add_model(ridge_spec)#We tune the penalty using cross-validationridge_res <-tune_grid( ridge_wf,resamples = remit_folds,grid = grid30,metrics = metrics1,control = ctrl)best_ridge <-select_best(ridge_res, metric ="rmse")final_ridge_wf <-finalize_workflow(ridge_wf, best_ridge)ridge_fit <-fit(final_ridge_wf, data = train_data2)# resultsbest_ridge
Penalty value with the lowest RMSE: best_ridge has a penalty ~ 0. The penalty disappears, with the model becomes almost identical to standard linear regression. The cross-validation procedure
Shows that adding regularization does not improve predictive performance relative to an unpenalized linear model.
Ridge_fit shows that ridge ≈ OLS Coefficients will be very similar to baseline linear model
Conclusion:
The ridge regression regularization indicated increasing deviance as the penalty decreased, with cross-validation selecting a penalty effectively equal to zero. This suggests that the unpenalized linear model already provides an optimal fit for the data.
4.4 LASSO Regression
Code
# we specify the LASSO modellasso_spec <-linear_reg(penalty =tune(), mixture =1) %>%set_mode("regression") %>%set_engine("glmnet")# we build the workflow - preprocess to prevent leakagelasso_wf <-workflow() %>%add_recipe(recipe_glmnet) %>%add_model(lasso_spec)# Tune lambda penalty using cross-validationlasso_res <-tune_grid( lasso_wf,resamples = remit_folds,grid = grid30,metrics = metrics1,control = ctrl)# choose the best lambda (lowest RMSE)best_lasso <-select_best(lasso_res, metric ="rmse")# finalize the workflowfinal_lasso_wf <-finalize_workflow(lasso_wf, best_lasso)# fit the final LASSO model on all training datalasso_fit <-fit(final_lasso_wf, data = train_data2)best_lasso
# A tibble: 2 × 4
model penalty mean std_err
<chr> <dbl> <dbl> <dbl>
1 ridge 0.204 6.46 0.370
2 lasso 0.0418 6.46 0.369
Comment
Cross-validation selected a non-zero penalty for the LASSO model, indicating that it improves predictive performance. The LASSO regularization path shows a small set of predictors entering the model as the penalty decreases, highlighting its role as a variable selection method.
Comment on selected predictors
The LASSO model selected a set of nine predictors. Remittance intensity is negatively associated with GDP per capita and macroeconomic instability, while unemployment, deportations, and internet access exhibit positive relationships, consistent with counter-cyclical and transaction-cost mechanisms.
Why LASSO is useful:
The dataset benefits from variable selection (LASSO), not from coefficient stabilization (RIDGE).
4.5 Random Forest
We use this model a regularization model, to reduce variance error by reducing the important of less important predictors. It is a bagging algorithm which considers each split and divides it into only useful predictors - It uses two hyper paramaters mtry which considers x predictors in each split (it can be tuned to optimal value of useful predictors. and min_n to stop spliting the data
Code
# For missing values inside the dependent variable remit_train_clean <- remit_train |>filter(!is.na(remittances_gdp))# Smaller CV for run time optimization rf_folds <-vfold_cv(remit_train_clean, v =5)## Given the high degree of missingness a recipe that accounts for nas will avoid it from breaking down using median imputation. recipe_alt <-recipe(remittances_gdp ~ stock + gdp_per + gdp_lag + unemployment + dist_cap + terror + deportations + deportations_lag + internet + inflation + country_name,data = remit_train_clean) |>update_role(country_name, new_role ="id") |>step_impute_median(all_numeric_predictors()) |>step_mutate(gdp_per =log(gdp_per +1)) |>step_normalize(all_numeric_predictors())bake(prep(recipe_alt, training = remit_train_clean), new_data = remit_train_clean)
## Creating a Random Forest Model set up for tuningrf_mod <-rand_forest(trees =tune(),mtry =tune(),min_n =tune()) |>set_mode(mode ="regression") |>set_engine(engine ="ranger", importance ="impurity", num.threads =4)## Creating a workflow. Need to use alternative specific because it accounts for missingness which will lead the model to fail. rf_wf <-workflow() |>add_recipe(recipe_alt) |>add_model(rf_mod)## Finalize parametersrf_params <- rf_wf |>extract_parameter_set_dials() |>finalize(remit_train_clean)rf_params## tuning grid rf_grid <-grid_max_entropy( rf_params, size =20 )## Tuning it within cross validation using our hyperparamters.rf_tuned <- rf_wf |>tune_grid(resamples = rf_folds,grid = rf_grid,control =control_grid(save_pred =TRUE))## Measuring the RMSE rf_tuned |>collect_metrics()
# A tibble: 40 × 9
mtry trees min_n .metric .estimator mean n std_err .config
<int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 20 1868 19 rmse standard 3.77 5 0.537 Preprocessor1_Model…
2 20 1868 19 rsq standard 0.718 5 0.0420 Preprocessor1_Model…
3 17 1054 3 rmse standard 3.41 5 0.545 Preprocessor1_Model…
4 17 1054 3 rsq standard 0.772 5 0.0437 Preprocessor1_Model…
5 9 1913 23 rmse standard 3.87 5 0.540 Preprocessor1_Model…
6 9 1913 23 rsq standard 0.703 5 0.0406 Preprocessor1_Model…
7 15 1724 3 rmse standard 3.41 5 0.540 Preprocessor1_Model…
8 15 1724 3 rsq standard 0.772 5 0.0428 Preprocessor1_Model…
9 16 1797 30 rmse standard 4.04 5 0.527 Preprocessor1_Model…
10 16 1797 30 rsq standard 0.671 5 0.0371 Preprocessor1_Model…
# ℹ 30 more rows
Code
rf_tuned |>show_best(metric ="rmse", n =10)
# A tibble: 10 × 9
mtry trees min_n .metric .estimator mean n std_err .config
<int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 15 1724 3 rmse standard 3.41 5 0.540 Preprocessor1_Model…
2 17 1054 3 rmse standard 3.41 5 0.545 Preprocessor1_Model…
3 14 319 8 rmse standard 3.50 5 0.525 Preprocessor1_Model…
4 14 1185 14 rmse standard 3.66 5 0.541 Preprocessor1_Model…
5 3 1582 7 rmse standard 3.73 5 0.589 Preprocessor1_Model…
6 20 1868 19 rmse standard 3.77 5 0.537 Preprocessor1_Model…
7 6 12 6 rmse standard 3.82 5 0.461 Preprocessor1_Model…
8 22 1048 22 rmse standard 3.85 5 0.526 Preprocessor1_Model…
9 2 737 4 rmse standard 3.85 5 0.594 Preprocessor1_Model…
10 9 1913 23 rmse standard 3.87 5 0.540 Preprocessor1_Model…
Code
## selecting the best specification and fit it to the full training data. best_rf <-rf_tuned |>select_best(metric ="rmse")final_rf_wf <- rf_wf |>finalize_workflow(best_rf)final_rf_fit <- final_rf_wf |>fit(data = remit_train_clean)# Variable importance final_rf_fit |>extract_fit_parsnip() |>vip(num_features =10)
4.6 K-Nearest Neighbors (KNN)
KNN predicts remittances by averaging the values of the K most similar country-year observations.
## Finalize & fit KNN on all training databest_k <-select_best(knn_results, metric ="rmse")final_knn_wf <-finalize_workflow(knn_workflow, best_k)final_knn_fit <- final_knn_wf |>fit(data = remit_train_clean_knn)
Code
## Predict on test test_predictions <- final_knn_fit |>augment(new_data = remit_test_clean_knn)
# A tibble: 7 × 6
country_name RMSE MAE R2 Avg_Remittances_GDP n
<chr> <dbl> <dbl> <dbl> <dbl> <int>
1 Mexico 0.102 0.0944 0.996 2.00 4
2 India 0.446 0.389 1 3.39 2
3 Haiti 1.66 1.60 0.978 14.1 4
4 Nicaragua 2.60 1.83 0.604 10.3 5
5 Honduras 5.43 4.07 0.221 18.7 5
6 El Salvador 5.49 3.14 0.0815 19.7 4
7 Guatemala 6.73 5.75 1 13.6 2
Code
avg_perf_interest
# A tibble: 1 × 4
RMSE MAE R2 n
<dbl> <dbl> <dbl> <int>
1 3.94 2.35 0.719 26
Code
## Visualize actual vs. predictedtest_predictions |>filter(country_name %in% countries_of_interest) |>ggplot(aes(x = remittances_gdp, y = .pred)) +geom_point(alpha =0.6) +geom_abline(slope =1, intercept =0, linetype ="dashed") +facet_wrap(~ country_name) +labs(title ="KNN: Test Set Performance by Country",subtitle ="Points near the diagonal = better predictions",x ="Actual Remittances (% GDP)",y ="Predicted",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
5.3 Key Insights and Policy Implications
Insights - Strong performance for large, diversified economies: Mexico and India show very low prediction error (RMSE < 0.5) and very high explanatory power (R² ≈ 1). The model performs best where remittances respond predictably to macroeconomic fundamentals and migration stocks. - Moderate accuracy for mid-dependence countries: Haiti and Nicaragua exhibit moderate RMSE (≈ 1.6–2.6) and reasonably strong fit. - Weak performance for highly remittance-dependent economies - Guatemala, El Salvador, and Honduras have the highest RMSE values (≈ 5–7) and low R², despite being central to U.S.–migration and deportation policy discussions. - These countries rely heavily on remittances (≈ 15–20% of GDP), making flows more sensitive to household-level shocks not captured by the model.
Code
## Plot residualsresiduals_df <- test_predictions |>filter(country_name %in% countries_of_interest) |>mutate(residual = .pred - remittances_gdp)ggplot(residuals_df, aes(x = year, y = residual)) +geom_point(alpha =0.6) +geom_smooth(alpha =0.4, se =FALSE) +geom_hline(yintercept =0, linetype ="dashed") +facet_wrap(~ country_name, scales ="free_y") +labs(title ="KNN Residuals Over Time by Country",subtitle ="Positive values = overprediction; negative = underprediction",x ="Year",y ="Residual (Predicted − Actual)",caption ="Source: World Bank Development Indicators (1994-2024)") +theme_minimal()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: span too small. fewer data values than degrees of freedom.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: pseudoinverse used at 2000.9
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
: There are other near singularities as well. 170.96
Interpretation: Residuals Over Time by Country - Large, diversified economies (Mexico, India) - Residuals are tightly centered around zero with very small magnitudes. - No clear time trend or structural bias is evident. - This reinforces that the KNN model captures remittance dynamics well when flows scale predictably with macroeconomic and migration variables. - Highly remittance-dependent Central American countries show systematic error - El Salvador and Honduras exhibit large, persistent negative residuals in later years, indicating systematic underprediction. - These patterns suggest the model fails to capture structural increases in remittance dependence over time.